File format | 1000 Genomes

Can I convert VCF files to PLINK/PED format?

Answer:

We provide a VCF to PED tool to convert from VCF to PLINK PED format. This tool has documentation for both the web interface and the Perl script.

An example Perl command to run the script would be:

perl vcf_to_ped_converter.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr13.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz
    -sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.sample_panel
    -region 13:32889611-32973805 -population GBR -population FIN

What are your filename conventions?

Answer:

Our filename conventions depend on the data format being named. This issue is described in more detail in the three questions below.

What is a bas file?

Answer:

Bas files are statistics we generate for our alignment files which we distribute alongside our alignment files.

These are readgroup level statistics in a tab delimited manner and are described in this README

Each mapped and unmapped bam file has an associated bas file and we provide them collected together into a single file in the alignment_indices directory, dated to match the alignment release.

What do the names of your fastq files mean?

Answer:

Our sequence files are distributed in gzipped fastq format

Our files are named with the SRA run accession E?SRR000000.filt.fastq.gz. All the reads in the file also hold this name. The files with _1 and _2 in their names are associated with paired end sequencing runs. If there is also a file with no number it is name this represents the fragments where the other end failed qc. The .filt in the name represents the data in the file has been filtered after retrieval from the archive. This filtering process is described in a README.

What do the names of your variant files mean and what format are the files?

Answer:

Our variant files are distributed in vcf format, a format initially designed for the 1000 Genomes Project which has seen wider community adoption.

The majority of our vcf files are named in the form:

**ALL.chrN

wgs

wex.2of4intersection.20100804.snps

indels

sv.genotypes.analysis_group.vcf.gz**.

This name starts with the population that the variants were discovered in, if ALL is specifed it means all the individuals available at that date were used. Then the region covered by the call set, this can be a chromosome, wgs (which means the file contains at least all the autosomes) or wex (this represents the whole exome) and a description of how the call set was produced or who produced it, the date matches the sequence and alignment freezes used to generate the variant call set. Next a field which describes what type of variant the file contains, then the analysis group used to generate the variant calls, this should be low coverage, exome or integrated and finally we have either sites or genotypes. A sites file just contains the first 8 columns of the vcf format and the genotypes files contain individual genotype data as well.

Release directories should also contain panel files which also describe what individuals the variants have genotypes for and what populations those individuals are from

What format are your alignments in and what do the names mean?

Answer:

All our alignment files are in BAM format, a standard alignment format which was defined by the consortium and has since seen wide community adoption. We also provide our alignments in CRAM Format

The bam file names look like:

NA00000.location.platform.population.analysis_group.YYYYMMDD.bam

The bai index and bas statistics files are also named in the same way.

The name includes the individual sample ID, where the sequence is mapped to, if the file has only contains mapping to a particular chromosome that is what the name contains otherwise, mapped means the whole genome mapping and unmapped means the reads which failed to map to the reference (pairs where one mate mapped and the other didn’t stay in the mapped file), the sequencing platform, the ethnicity of the sample using our three letter population code, the sequencing strategy. The date matches the date of the sequence used to build the bams and can also be found in the sequence.index filename.

Why is the sequence data distributed in 2 or 3 files labelled SRR_1, SRR_2 and SRR?

Answer:

We distribute our fastq files for our paired end sequencing in 2 files, mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.

IGSR: The International Genome Sample Resource

Supporting open human variation data

Links

Can I convert VCF files to PLINK/PED format?

Answer:

Related questions:

What are your filename conventions?

Answer:

Related questions:

What is a bas file?

Answer:

Related questions:

What do the names of your fastq files mean?

Answer:

Related questions:

What do the names of your variant files mean and what format are the files?

Answer:

Related questions:

What format are your alignments in and what do the names mean?

Answer:

Related questions:

Why is the sequence data distributed in 2 or 3 files labelled SRR_1, SRR_2 and SRR?

Answer:

Related questions: